{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# `plot_correlation()`: analyze correlations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "The function `plot_correlation()` explores the correlation between columns in various ways and using multiple correlation metrics. The following describes the functionality of `plot_correlation()` for a given dataframe `df`.\n",
    "\n",
    "1. `plot_correlation(df)`: plots correlation matrices (correlations between all pairs of columns)\n",
    "2. `plot_correlation(df, col1)`: plots the most correlated columns to column `col1`\n",
    "3. `plot_correlation(df, col1, col2)`: plots the joint distribution of column `col1` and column `col2` and computes a regression line\n",
    "\n",
    "The following table summarizes the output plots for different settings of `col1` and `col2`.\n",
    "\n",
    "| `col1` | `col2` | Output |\n",
    "| --- | --- | --- |\n",
    "| None | None | *n*\\**n* correlation matrix, computed with [Person](https://www.wikiwand.com/en/Pearson_correlation_coefficien), [Spearman](https://www.wikiwand.com/en/Spearman%27s_rank_correlation_coefficient), and [KendallTau](https://www.wikiwand.com/en/Kendall_rank_correlation_coefficient) correlation coefficients | \n",
    "| Numerical | None |  *n*\\*1 correlation matrix, computed with Pearson, Spearman, and KendallTau correlation coefficients |\n",
    "| Categorical | None | TODO |\n",
    "| Numerical | Numerical | [scatter plot](https://www.wikiwand.com/en/Scatter_plot) with a regression line |\n",
    "| Numerical | Categorical | TODO |\n",
    "| Categorical | Numerical | TODO |\n",
    "| Categorical | Categorical | TODO |\n",
    "\n",
    "Next, we demonstrate the functionality of `plot_correlation()`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the dataset\n",
    "\n",
    "`dataprep.eda` supports **Pandas** and **Dask** dataframes. Here, we will load the well-known [wine quality dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality) into a Pandas dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-18T08:06:34.585634Z",
     "start_time": "2020-11-18T08:06:29.182311Z"
    }
   },
   "outputs": [],
   "source": [
    "from dataprep.datasets import load_dataset\n",
    "df = load_dataset(\"wine-quality-red\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get an overview of the correlations with `plot_correlation(df)`\n",
    "\n",
    "We start by calling `plot_correlation(df)` to compute the statistics and correlation matrices using Pearson, Spearman, and KendallTau correlation coefficients. For the Stats tab, we list four statistics for these three correlation coefficients respectively. Other three tabs are the lower triangular matrices. In each matrix, a cell represents the correlation value between two columns. There is an \"insight\" tab (!) in the upper right-hand corner of each matrix, which shows some insight information. The following shows an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-18T08:06:45.274414Z",
     "start_time": "2020-11-18T08:06:34.596129Z"
    }
   },
   "outputs": [],
   "source": [
    "from dataprep.eda import plot_correlation\n",
    "plot_correlation(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Find the columns that are most correlated to column `col1` with `plot_correlation(df, col1)`\n",
    "\n",
    "After computing the correlation matrices, we can discover how other columns correlate to a specific column `x` using `plot_correlation(df, x)`. This function computes the correlation between column `x` and all other columns (using Pearson, Spearman, and KendallTau correlation coefficients), and sorts them in decreasing order. This enables easy determination of the columns that are most positively and negatively correlated with column `x`. The following shows an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-11T03:38:28.053015Z",
     "start_time": "2020-11-11T03:38:27.196819Z"
    }
   },
   "outputs": [],
   "source": [
    "plot_correlation(df, \"alcohol\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore the correlation between two columns with `plot_correlation(df, col1, col2)`\n",
    "\n",
    "Furthermore, `plot_correlation(df, col1, col2)` provides detailed analysis of the correlation between two columns `col1` and `col2`. It plots the joint distribution of the columns `col1` and `col2` as a scatter plot, as well as a regression line. The following shows an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-11T03:38:28.484588Z",
     "start_time": "2020-11-11T03:38:28.059253Z"
    }
   },
   "outputs": [],
   "source": [
    "plot_correlation(df, \"alcohol\", \"pH\")"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}